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ABSTRACT 

Big data is becoming ever more ubiquitous, ranging over mas¬ 
sive video repositories, document corpuses, image sets and Internet 
routing history. Proximity search and clustering are two algorith¬ 
mic primitives fundamental to data analysis, but suffer from the 
“curse of dimensionality” on these gigantic datasets. A popular at¬ 
tack for this problem is to convert object representations into short 
binary codewords, while approximately preserving near neighbor 
structure. However, there has been limited research on constructing 
codewords in the “streaming" or “online" settings often applicable 
to this scale of data, where one may only make a single pass over 
data too massive to fit in local memory. 

In this paper, we apply recent advances in matrix sketching tech¬ 
niques to construct binary codewords in both streaming and online 
setting. Our experimental results compete outperform several of 
the most popularly used algorithms, and we prove theoretical guar¬ 
antees on performance in the streaming setting under mild assump¬ 
tions on the data and randomness of the training set. 

1. INTRODUCTION 

Due to overwhelming increase in sheer volume of data being 
generated every day, fundamental algorithmic primitives of data 
analysis are being run on ever larger data sets. Thes^rimitives 
includ e ap proximating nearest neighbour search Ict. cluster¬ 
ing IT dsII . low dimensional embeddings or learning distri¬ 

butions from a limited number of samples 02^ etc. 

A prominent approach for handling gigantic datasets is to con¬ 
vert object representations to short binary codewords such that sim¬ 
ilar objects map to similar binary codes. Binary representation is 
widely used in data analysis tasks, for example Song et.al (H gave 
an algorithm for converting a large video dataset into a set of binary 
hashes. Seo ll^ proposed a binary hashing scheme for music re¬ 
trieval. Fergus, Weiss and Torralba d employed a spectral hash¬ 
ing scheme for labeling gigantic image datasets in semi-supervised 
setting, lulie and Triggs 1^ used binary feature vectors for visual 
recognition of objects inside images. Guruswami and Sahai d 
give an embedding into Hamming space that reduces multi-class 
learning to an easier binary classification problem. 
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Codewords as succinct representation of data serve multiple pur¬ 
poses; 1) They can be used for dimensionality reduction, 2) They 
can emphasize user-desired distance thresholds, i.e. to encode data 
points such that near neighbors become much closer in Hamming 
space, rather than a simple proportionate embedding of distances, 
and 3) They allow the use of efficient tree based search data struc¬ 
tures and enable the use of nearest neighbor techniques in Ham¬ 
ming space. (For more on how to conduct such searches quickly 
in Hamming space, see for instance the work by Norouzi, Punjani 
and Fleet 1^ or by Esmaeili, Ward and Fatourechi d.) 

Sometimes these codes may be found trivially, e.g. if a dataset 
is already described by binary features, or is partitioned in a local¬ 
ity preserving and hierarchical manner. However where we are not 
so fortunate, we need to learn them by seeking help of a construc¬ 
tive similarity function. For instance, unsupervised methods derive 
codewords from feature vectors in Euclidean space, or construct 
them from a data independent affinity matrix. On the opposite side, 
supervised methods 1^1^ take additional contextual information 
into account and use a similarity notion that is semantically mean¬ 
ingful for codewords, e.g. two documents are similar if they are 
about the same topic or two images are similar if they contain same 
objects and colors. 

On a meta-level, any binary coding scheme should satisfy three 
following properties to be considered effective; 

1. The codes should be short so that we can store large datasets 
in memory. 

2. Codes should be similarity-preserving; i.e., similar data points 
should map to similar binary codes while far data points should 
not collapse to small neighborhoods. 

3. The learning algorithm should efficiently compute codes for 
newly inserted points without having to recompute the entire 
codebook. 

The need to simultaneously satisfy all three constraints above makes 
learning binary codes a challenging problem. 

Broadly speaking, binary coding techniques fall into two cate¬ 
gories; first class is the family of techniques, referred to as sym¬ 
metric, which binarize both datapoints of a dataset or database and 
query points, usually according to the same hashing scheme. This 
class includes locality sensitive hashing(LSH) da , spectral hash¬ 
ing m, locality sensitive binary codes l^. Iterative Quantiza- 
tion(ITQ) m or semi-supervised hashing MUl techniques. In con¬ 
trast, the second class of methods, namely asymmetric algorithms, 
binarize only data points and not query points, e.g. l^fllLflA 12311 . 
These methods achieve higher accuracy due to greater precision in 
the query description, yet still have the storage and efficiency gains 
from binarizing the ground dataset. 


2. BACKGROUND AND NOTATION 

First, we briefly review some notation. We use lower case let¬ 
ters to denote functions, e.g. w(x) and upper case letters to rep¬ 
resent matrices, e.g. W. An n x d matrix A can be written as 
a set of n rows as A 2 ,:; ... , where each row Ai.-, is a 
datapoint of length d. Equivalently, this matrix can be written as 
a set of d columns as [A;,!, A;, 2 ,..., A;_d]. The element at row 
i and column j of matrix A is denoted by Aij. The Frobenius 
norm of a matrix A is defined ||A||_f = P where 

IIAi,; II is Euclidean norm of Ai,,. Let Ak refer to the best rank k 
approximation of A, specifically Ak = a,TgmaxQ.^^„f^(^(^<^^f.\\A — 
CIIf. The singular value decomposition of A G written 

svd(A), produces three matrices [U,'S,V] so that A = IfEV^. 
Matrices U G and V G are orthogonal and their 

columns are the left singular vectors and right singular vectors, 
respectively. Matrix E G E"^'* is all Os except for the diagonal 
entries {Ei.i, E 2 , 2 ,..., Er.r}, the singular values, where r < dis 
the rank. Note that E^j > Ej+i,j+i for all 1 < j < r — 1, speclral 
norm of a matrix is ||A ||2 = Ei,i, and Ejj = ||AV^,j|| describes 
the norm along direction V-.j. Numeric rank of a matrix A is de¬ 
fined as ||A|||./||A||i and trace of a square matrix M G E"’^" is 
Tr(M) = For square matrix A G E"'^", eigen de¬ 

composition of A is eig(A) = UAU"^ where U G E"^" contains 
eigen vectors as columns, and A G E"^"' is a diagonal matrix 
containing eigen values {Ai,i, A 2 , 2 , • • •, A„,„} in non-increasing 
order. Finally, expected value of a matrix is defined as the matrix 
of expected values, i.e. 


E[A] 


/ E[Ai,i] 
V E[An,l] 


E[Ai,d] 

E[A„,d] 


2.1 Related Works 

One of the basic and most popular binary encoding schemes is 
“Locality Sensitive Hashing” ILSHl fldl which uses random pro¬ 
jections to embed data into lower dimensional space. This is done 
by employing a class of functions called locality-sensitive hash 
functions under which similar points collide with high probabil¬ 
ity. A family of hash functions H is called (r, cr, Pi, P 2 )-sensitive 
if for any two points p, q G E'* and any hash function h £ H, the 
following two properties hold: 

1. If Up — q|| < f then Pir[/i(p) = h{q)] > Pi and 2) if ||p — 
g|| > cr then Prf[/i(p) = h{q)] < P2, where Ph denotes 
the probability of an event under family of hash functions H, 
and h{p) is the hashed value of point p under hash function 
h. Note in this definition r > 0 is a threshold on distance 
and c is an approximation ratio, and in order for LSH family 
to be useful it should be that Pi > P2. 


2. LSH is a data independent method and can be done in stream¬ 
ing setting as it does not need to store data points and hash¬ 
ing or projection can be done on the fly. It is folklore that for 
random datasets LSH is near optimal, but in practice is gen¬ 
erally outperformed by methods that use spectrum of data. 
To simplify somewhat, k bit binary codewords of LSH can 
be assigned to a point in E"^ by taking dot product with a col¬ 
lection of k random vectors, and assigning each bit as 0 or 1 
according to the sign of the value obtained @]. 

One of the most famous binary encoding schemes is “Spectral 
Hashing’XSH) (i^l . If IF' G E"^" is similarity matrix and Y G 
E"xfe -g binary coding matrix for k being the length of code¬ 
words, then this method formulates the problem as minimizing 


— ■Kj,;|p with subject to Y{i,j) G {—1,1}, bal¬ 
ance constraint, i.e. X]j=i = 0 for each binary codeword IX,;, 
and evenly distributed constraint that enforce each bit be evenly 
distributed on -|-1 and —1 over the dataset. It’s not too hard to 
show that this optimization is equivalent to minimizing Tr(y^(P — 
W}Y) where Y G E''^^ is the matrix containing codewords, D 
is the degree matrix with Du = X]”=i FFij. However, due to the 
binary constraint Y{i,j) G {—1,1}, this probelm is NP hard ,so 
instead authors threshold a spectral relaxation whose solution is the 
bottom k eigenvectors of graph Laplacian matrix L = D — W G 
j^nxn however provides a solution to only training data- 

points. In order to extend it to out-of-samples, they assume dat- 
apoints are sampled from a separable probability distribution p{x)’, 
using the fact that graph Laplacian eigenvectors converge to the 
Laplace-Beltrami eigenfunctions of manifolds, they set thresholded 
eigen functions as codewords. However they only examine the sim¬ 
ple case of a multidimensional uniform distribution or box shaped 
data, as these eigen functions are well-studied. 

In fl^ . Fergus et al. extended their previous work li^l to any 
separable distribution, i.e. any distribution p{x) with a product 
form. They consider semi-supervised learning in a graph setting, 
where a labeled dataset of input-output pairs {Xm, Ym) = {{xi,yi), 
..., {Xm,ym)} is given, and they need to label a larger set Xu = 
{xm+i, ..., Xn\ of unlabelled points. Authors form the graph of 
all datapoints Xm U where vertices represent datapoints and 
edges are weighted with a Gaussian function Wi,j = exp(—||a;i — 
XjII/ct). The goal is to find functions / which agree with labeled 
data but are also smooth with respect to the graph, therefore they 
formulate the problem as minimizing the error function J(/) = 
f^Lf + AX]i=i(/(*) ~ where f{i) is the embedding of 
i-th point and A is a diagonal matrix whose diagonal elements 
are Ai,i = A if is a labeled point and Ai,i = 0 otherwise. 

Note that Lf is the smoothness operator defined on the entire 

graph Laplacian as /^L/ = l/2J2ij Wij{f{i) - and 

AX]i=i(/(*) “ Vi)^ represents the loss on the labeled data. Sim¬ 
ilar to their previous work llT^l authors approximate the eigen vec¬ 
tors of L by eigen functions of laplace-beltrami operator defined on 
probability distribution p. 

Finally, in the most recent work of this series, “Multidimensional 
Spectral Hashing”(MDSH) lidll . Weiss et al. inlroduced a new for¬ 
mulation for learning binary codes; unlike other methods that min¬ 
imize Hamming distance \\Yi^: — Yj^:\\, MDSH approximates origi¬ 
nal affinity Wij with weighted Hamming affinity Yu-KYy-., where 
A = diag(Ai, • ■ • ,Ak) gives a weight to each bit. The authors 
show the best binary codes are obtainable via performing binary 
matrix factorization of affinity matrix, with the optimal weights 
given by the singular values. 

SSH and MDSH can be adapted to the streaming setting, but 
have the unsatisfactory elements that neither addresses approximat¬ 
ing the initial matrix optimization directly. Moreover the hashing 
functions (eigenfunctions) they learn are wholly determined by the 
initial training set and do not adapt as more points are streamed in. 

In another line of works, authors formulate the problem as an 
iterative optimization. In iflbll . Gong and Lazebnik suggest “Itera¬ 
tive Quantization”(ITQ) algorithm which is an iterative approach 
based on alternate minimization scheme that first projects data¬ 
points onto top k right singular vectors of data matrix, and then 
takes the sign of projected vectors to produce binary codes. Au¬ 
thors show that if we consider projected datapoints as vectors in a 
fc-dimensional binary hypercube C G { — 1,1}^, then sign of vec¬ 
tor entries in each dimension is determined by the closest vertex of 
hypercube along that dimension. As rotating this hypercube does 




not change the codes, they alternatively minimize the quantization 
loss Q(B,R) = ||_B — y_R|||.by fixing one of two variables B the 
binary codes, or R the rotation matrix, and solving for the other. 
They show that in practice repeating this for at most 50 iterations 
beats some well-known methods including [3, El and spectral 
hashing!^. 

Heo et al. citespherical present an iterative scheme that parti¬ 
tions points using hyperspheres. Specifically, the algorithm places 
k balls, such that the ith bit a point S;i is 1 if it is contained in 
the k-th ball and 0 otherwise. At each step of the process if the 
intersection of any two balls contains too many points, a repulsive 
force is applied between them, whereas if the intersection contains 
too few an attractive force is applied. This minimization continues 
until a reasonably balanced number of the points are contained in 
each hypersphere. Both these iterative algorithms seem difficult to 
adapt to a streaming setting, in the sense that the hash functions are 
expensive to learn on a training set and not easily updated. 

In the supervised setting, Quadrianto et al. 1^ present a proba¬ 
bilistic model for learning and extending binary hash codes. They 
assume the input dataset follows certain simple and well studied 
probability distributions, and that supervision is provided in terms 
of labels indicating which points are neighbors and which are far. 
Under these constraints, they may train a latent feature model to ex¬ 
tend binary hash codes in the streaming setting as new data points 
are provided. 

In the broader context, the most comparable line of works with 
our problem is matrix sketching in the stream. Although there 
has been a flurry of results in this direction^, [3, E3]> we men¬ 
tion those which are most related to our current work. In d, 
Drineas and Mahoney approximate a gram matrix G G by 

sampling s columns (datapoints) of an input matrix A G R"^^" 
proportional to the squared norm of columns. They approximate 
G with G = CW^ C^, where C G R"^'* is the gram matrix be¬ 
tween n datapoints and s sampled points, Wk G R'”^“ is the best 
rank ktoW where W is the gram matrix between sampled points. 
They need to sample 0{k/e*) columns to achieve Frobenius er¬ 
ror bound IIG — Gk\\F < ||G — Gk ||f + £ need to 

sample 0(fc/e^) columns to get spectral error bound ||G —Gfe ||2 < 
||G—Gfe||2-l-e5I]r=i Their algorithm needs 0(ci/e^-|-1/e'^) 
space and has running time of 0{nd -f n/e^ -|- 1/e"^) on train¬ 
ing set. The update time for any future datapoint (or test point) is 
0{d/e^ + l/e*). 

The state-of-the-art matrix sketching technique is FD algorithm 
first introduced by Liberty (3l and then reanalyzed by Ghashami 
and Phillips d. FD maintains a deterministic, small space sketch 
for an input matrix and can be easily incrementally updated in the 
stream. In fact, for any input matrix A G FD maintains 

a sketch B G R^^'* with I = 2/e rows, achieves error bound 
\\A^A — B"''B \\2 < e||A|||. and runs in time 0(nd/e). It is shown 
by Woodruff that the approximation quality is optimal l43ll . 

2.2 Our Result 

In this paper, we focus on finding codewords for a dataset S C 
R'* given in a stream. We consider an unsupervised setting where 
mutual similarity between datapoints is induced by Gaussian kernel 
function w{p, q) = exp(—||p — glP/o") rather than any contextual 
information. We develop a reasonable model of data holding two 
assumptions: 

1. sparsity, that enforces data similarity not being dominated by 
“near-duplicates”. 

2. bounded doubling dimension, that is data has a low-dimensional 
structure. This assumption is widely used as “effective low- 


dimension” in Euclidean near neighbor search problems (3 . 
EE El [E El [30] 5 and corresponds well with existence of a 
good binary codebook Q. 

Under this model, we propose the “Streaming Spectral Binary Cod¬ 
ing” (SSBC) algorithm that builds off of FrequentDirections 
and shows that if training set is a “good representor” of the stream, 
i.e. that the stream is in random order, then one can accurately up¬ 
date important directions (eigen vectors) of the weight matrix in a 
stream. These vectors are then used to construct the desired code¬ 
words. 

In fact, as we show in section our technique works in both 
streaming and online settings, achieves 0{n/e poly log n) space in 
former setting and 0{k / e poly log n) space in latter setting. Note 
both bounds are much smaller than Q{n^) which is the required 
space for storing similarity matrix. 

Our starting matrix optimization formulation is closely adapted 
from those posed in this line of work by Fergus, Weiss and Tor- 
ralba. However, our approach to solving the problem and out-of- 
sample extension differs fundamentally from previous tactics of us¬ 
ing functional approximation methods and learned eigenfunctions. 
We maintain a sketch of the weight matrix instead, and adjust it 
during the course of the stream. While known functional analysis 
techniques rely on assumptions on the data distribution (in partic¬ 
ular that it is drawn from a separable distribution) we argue that 
solving for the matrix approximation directly addresses the original 
optimization problem without such restrictions, thereby achieving 
the superior accuracy our experiments demonstrate. 

3. SETUP AND ALGORITHM 

In this section, we first set up matrix optimization problem that is 
the starting point of the work by Weiss, Fergus and Torralba m, 
then we describe our algorithm “Streaming Spectral Binary Cod¬ 
ing” (SSBC) for approximating binary codewords in a stream or 
online setting. 

3.1 Model and Setup 

We denote input dataset as S G containing n datapoints in 

R'* space and represent binary codes as y G R"^*, where k G 
is a parameter specifying length of codewords. 

We define affinity or similarity between datapoints Si,-, and Sp¬ 
as, w{Si,:, Sj,:) = exp(—IlSi,; — Sj,;|p/a^) where cr is a pa¬ 
rameter set by user, corresponding to a threshold between “near" 
and “far" distances. Since codewords are vectors with ±1 entries, 
one can write ||y,; — U,:|P = 2k — 2Yp,Yj^,, and match Ham¬ 
ming affinity Yp.Yj^.. with w(Si.:, Sj,-.) instead of minimizing Ham¬ 
ming distance. Similar to Idlll . we define a diagonal weight matrix 
A = [Ai,i, • ■ • , Ak,k] to give an importance weight Ajj to y-th bit 
of codewords. Therefore we formulate the problem as: 

(y*,A*)= argmin (w{i, j) - Yp.AYp.\ 

Yi,:6{±l}LA ij ^ ' 

= argmin ||iy - yAy^|||. 

This optimization problem is solvable by a binary matrix factor¬ 
ization of the affinity matrix, W. As discussed in IEEEIDj the ±1 
binary constraint makes this problem computationally intractable, 
but a relaxation to real numbers results in a standard matrix fac¬ 
torization problem that is easily solvable. IfW = U AU^ is eigen 

* A good binary codebook is roughly equivalent to a low distortion 
embedding into a low-dimensional Hamming space. 



decomposition of W, then i-th row of Uk £ R" ^ provides a code¬ 
word of length k for i-th datapoint, which can be easily translated 
into a binary codeword by taking sign of entries. The result binary 
codeword will be an approximation to the solution of binary matrix 
factorization. 

We consider solving binary encoding problem in two settings 
“streaming” and “online”, where in both model one datapoint ar¬ 
rives at a time, is processed quickly and not read again. In the 
streaming setting, we output all binary codewords at the end of 
stream, while in the online setting, we are obliged to output binary 
codeword of current datapoint before seeing next datapoint. Space 
usage is highly constrained in both models, so we cannot store the 
entire weight matrix W (of size fl(n^)) nor even the dataset itself 
(of size 0{nd)). 

Below, we specify assumptions we make in our data model for 
the purposes of theoretical analysis. However, we note that our 
experiments show strong results without enforcing any restrictions 
on the datasets we consider. 

1. Our first assumption is “sparsity", namely that no two points 
p and q are asymptotically close to each other. Specifically, 
that Up — g|| > (0.1 fr/logn), for all 1 < i,j < n, where 
(T is the threshold distance parameter of our Gaussian kernel. 
When our data is being analyzed for clustering/near neighbor 
purposes, this condition implies that identical points have ei¬ 
ther been removed or combined into a single representative 
point. 

2. Our second assumption is that the data has bounded doubling 

dimension do. Namely that a ball B of radius r contains at 
most O j points spaced at distance at least e. This 

is a standard model in the algorithms community for model¬ 
ing data drawn from a low dimensional manifold. It is also 
intuitively compatible with the existence of a good represen¬ 
tation of our data by fc-bit codewords for bounded k, as bi¬ 
nary encoding is simply an embedding into fc-dimensional 
Hamming space. 

3.2 Streaming Binary Coding Algorithm 

Our method, which we refer to as “SSBC” is described in algo- 
rithm l3.ll SSBC takes three input values Strain, Stsst and k where 
Strain is a Small training set sampled uniformly at random from the 
underlying distribution of data, e.g. pt. We denote size of Strain 
by IS'trainI = m, and we assume m > polylogin). For ease of 
analysis, wherever we come across some poly login) to a constant 
exponent, we assume the term to be smaller than m. On the other 
hand, S'test is a potentially unbounded set of data points coming 
from same distribution p. Even though Stest can be unbounded, 
for the sake of analysis, we denote total number of datapoints in 
union of both sets as n = | Strain \ + \ Stest \ ■ Value fc > 0 is the 
length of the codewords we seek. 

The algorithm maintains a small sketch B with only f <C m -C 
n rows. For each datapoint p £ Strain, SSBC computes its (Gaus¬ 
sian) affinity with all points in Strain, outputs an m dimensional 
vector Wp as the result, and inserts it into B. Once B is full, SSBC 
takes svd of B ([[/, E, V] = B), subtracts off smallest singular 
value squared, i.e. from squared of all singular values, and 

reconstruct B as B = . This results in zeroing out last row 

of B, and making space for processing next upcoming train data¬ 
point. Note after processing Strain, matrix V £ contains an 
^-dimensional approximation to similarity structure of train set. As 
we observe, SSBC employs FD algorithm ll^ to process affinity 
vectors in streaming manner; instead of referring to FD, we in¬ 


cluded its pseudocode completely in algorithm [3T| As many sim¬ 
ilarity measures can be used to capture the affinity between dat¬ 
apoints, SSBC uses Gaussian affinity W — exp (—||q — p|P/(t), 
where cr is a parameter denoting the average near neighbor distance 
we care about. This function is called in subroutine |3(2] to measure 
the affinity between any test point and all train datapoints. 

At any point in time, we can get binary codeword of any dat¬ 
apoint q £ S by first computing its affinity with Strain, getting 
vector Wq as output and multiplying it by right singular vectors. 
More specifically if pq denotes binary codeword of q, then pq = 
signiwq x V) £ R^ gives a f-length codeword. To get a codeword 
of length k, we truncate V to its first k columns, 14 £ ■ 


Algorithm 3.1 Streaming Spectral Binary Coding (SSBC) 

Input: Strain, Stest C M'*, fc £ as length of codeword 
Define S = \Strain', ‘5'test], trt = and n = |5*| 

Set £ = [fc -F fc/e] as sketch size 
Set B £ R^^"* to full zero matrix 
for t £ [1 : n] do 

w = Gaussian Affinity (5^; , Strain) 

Insert w into a full zero row of B 
if B has no full zero rows then 
[U, E, V] ^ svd(B ) 

E' = y^E^ - E2 , 

B = E'H^ 

Return B 


A notable point about SSBC is that it can construct binary code¬ 
words on the fly in an online manner, i.e. using current itera¬ 
tion’s matrix V to generate the binary codeword for current dat¬ 
apoint. As we show in section |4] this leads to the small space us¬ 
age of 0(£m) = Oil/polylogin)). Clearly, SSBC can gen¬ 
erate all codewords at the end of stream too (streaming setting); 
in that case it needs to store all w vectors and uses final matrix 
V to construct codewords. Space usage in streaming setting is 
0(nm -F £m) = O ((1/e^ -F n)polylogin)). The update time (or 
test time) in both models is 0(md -F di) = Oid/epolylogin)). 


Algorithm 3.2 Gaussian Affinity 

Input: q £ R'* as a test point. Strain C R"* 

Define cr to similarity threshold between points in Strain 
Set W £ R"* to zero vector, where m — | Strain | 

Set i = 0 

for p in Strain do 

W\i] = exp(-||g-pf/cr) 
i ++ 

Return W 


To explain good performance of SSBC, we argue that under the 
data model described in Section lrTI squared norms of the columns 
of W are within a polylogin) factor of each other. Using this fact, 
we show a uniform sample of the columns of FU is a good ap¬ 
proximation to W. In what follows, let Ci, Cmax and Cmin de¬ 
note squared norm of t-th column of W, maximum and minimum 
squared norm of any column of W respectively. 

Lemma 3.1. Under “sparsity" and “bounded doubling dimen¬ 
sion ” assumptions: 

CmaxICmin < (log(n) 

Proof. First note that it is trivially true that Cmin > L since 
Cmin > Wi^i = exp^(-||Si.: - Si,:|p) = exp^(0) = 1. We 









now upper bound Cmax- Let Ci denote squared norm of an arbi¬ 
trary column of W, so that upper bounding Ci would also bound 
Cmax- Let 5i,; be the corresponding datapoint associated with col¬ 
umn We proceed by partitioning points of S close to (similar) 
and far (dissimilar) from as Pc and Pf, respectively. Define 
Pf = {Sj,: e 5, s.t. Wij < i} and Pc = {Sj,-. G S, s.t. 
Wij > L}. Note that the contribution of Pf to Ci is at most 
|P/| L < 1, and contribution of Pc to Ci is at most IPcI • 1 < |Lc|. 
So we bound the size of Pc- First we upper bound distance of any 
point Sj,-. G Pc to point Si,-, as following: 



Therefore \\Si,-. — <S'j,:|P < a Inn. 

Now considering the sparsity condition, we have that the number 
of points Sj,: within a Inn distance of Si,-, is at most ( * 0*1 — 

(logn)°('^'>\ □ 


We immediately get the following corollary as a consequence; 


Corollary 3.1. IthoIds\/i, 1 < i < n that 


1 

polylog{n) 



<Ci< poly log {n) 


III 


n 


Proof. For the upper bound, we have nCmin < I|W^IIf> or 
C-min < - But for arbitrary Ci, we have Ci < polylog{n)Cmin 

and hence Ci < polylog{n) . The lower bound on Ci fol¬ 
lows similarly using Cmaa;. D 

4. ERROR ANALYSIS 

In this section, we prove our main result. Let W G 
be the exact affinity matrix of n datapoints in S, where Wi,j = 
exp(—IlS'i,; — Sj,-.\\'^/a). Let m = \Strai-n\ be size of train¬ 
ing set and W G R"^™- be the rescaled affinity matrix between 
all points in S and Strai-n- Under the assumption that Strai-n is 
drawn at random, we can imagine FF is a column sample drawn 
uniformly at random from W. In the general case, column sam¬ 
ples are only good matrix approximations to W if each column is 
drawn proportional to its norm, which is not known in advance in 
streaming setting. However we show that under our data model as¬ 
sumptions of Section lrTl a uniform sample suffices. Define W-.,j* 
to be the column of W that gets sampled for y-th column of W. 
(This corresponds to a choice of Sj*,-. as the j-th point in Strai-n). 
Now define the scaling factor of W as Wi,j = ^Wi,j*. De¬ 
fine W = WB^BW^ as approximated affinity that could be con¬ 
structed from W and the output of SSBC, i.e. B G 

We show that for m = poly log (n) log(l/J)) and £ = 2/e, 

then IIVF^ — IUII 2 < e||FF|||. with probability at least 1 — <5. In our 
proof we use the Bernstein inequality on sum of zero-mean random 
matrices, which is stated below. 

Matrix Bernstein Inequality. 

Let E\, - ■ ■ , E-m G R"’^" be independent random matrices such 
that for all 1 < i < m, E[i?i] = 0 and ||i?i ||2 < A for a fixed 
constant A. If we define variance parameter as 

m m 

:= max{|| ;^E[I5fi?J||2, \\J2mEl]h} 

_ i=l _ i=l 

^Our algorithm does not actually construct W and W. Rather we 
use them as existential objects for our theoretical analysis. 


Then for all f > 0: 

( 3a^ + 2At ) 

Lemma below bounds spectral error between FF and W. 

Lemma 4.1. JfW is the exact affinity matrix of points 

S and W G is the affinity matrix between points in S and 

Strain, then for m = Q poly log (n) log(l/J)) 



\\W^ - FFIU ^||2 < s\\W\\l 
holds with probability at least 1 — d. 

Proof. Consider m independent random variables Di = ^FF^ 
W-,iW^i. We can show E[i?i] = 0 as follows 


E[Ei] = -FF" - E[FF:.ilU,|] 

m I > . j 



W-.,j*W.^j* 


= —FF" - — V FF. jWi^j 
j=i 

= IpK" - Iff^ff"’= 0 

m m 


Note that last equality is correct because FF is a symmetric matrix, 
and therefore FF" = FFFF^. We can now bound E[FFIU^] = 
Y.T=t ^W-.,iW^,i] = Using this result we 

bound ||i?i ||2 as follows 


|£'ill2 = 


lw^_w.iw:^ 

m 


— E[FFFF^] - FF. iFF.| 

m 


< 


E[1UIU^] -F ||FF:,iIU.|||2 
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^e[||FF||^] +||FF:,i||| 


m 

1 


< 

m 

1 

< —E 

m 

1 

< 


■ -F 


+ ^\\W..,i*\\l 


(poly log {n) 


m 


= O 


( polylogjn) ]_ 
y m 


Where the fourth line is achieved using Jensen’s inequality on ex¬ 
pected values, which states ||E[A]|| < E[||X||] for any random 
variable X and the third last line by Corollary 13.11 Therefore 
A = IllUlll/tti -F ||FF||I polylog{n)/m for all Ei&. 

In order to bound variance parameter first note due to sym¬ 
metry of matrices Ei, its definition reduces to 




< E 




< m E Et 


Where the last step follows since all the Ei are identical random 
variables. We already have an upper bound on the value ||i5i||2 




































may achieve, and hence the square of this upper bounds E[||i5? II 2 ]. 
We bound m E[||_Ej^|| 2 ] as follows: 

m E II 2 ] < m WEiWl 


< m O 


/ poly log {n) 



< O 




Setting M = ~ WW and using Bernstein 

inequality with f = £||Ty |||’ we obtain 


Pr ^\\W^ - wW^h > e\\Wfp 
< 2nexp 


jeWWWl)^ 


3||W|||,po/y/oj(n)/m2+2£||W||4 + 

< 5 


— 277, 0xp 

'^3polylog{n) + 2£(m + polylogin)) ^ 

Taking natural logarithm from both sides and inverse ratios, we get: 

Spolylogjn) + 2epolylogin) ^n/S) 

e'^m? em ~ 

Considering that e < 1, we seek to bound: 

+— <ln-^(2n/J) 

em 

Solving for m we obtain that for m = (i polylogin) log(l/5)), 
the bound holds with probability at least 1 — 5. □ 


Hence W has a similar spectrum toW. In the lemma below, we 
argue that spectrum of W can be captured well by sketch B. To 
this end we define W — WB^BW^ and show W is again similar 
to W ; intuitively W approximates projection of W onto the right 
singular vectors of B. 

Lemma 4.2. Let W be the affinity matrix between datapoints 
in S and Strain- Then for W = WB^ BW'^ and 1 = 0 (i) 

\\ww'^ -w \\2 <£||vl||I’ 

Proof. We can bound ||IL — iyH ^^||2 as following: 

\\w - wW^h = Wwb'^bw'^ - wW^h 
= \\WiB'^B-W'^W)W''\\2 

< \\W\\2\\B'^B -W'^WhWW'^h 

= \\B'^B -W'^Wh 

< \mii^ 

And we can also bound 


i = l i= 

m 


^polylogin)\^ = E£M2S^ 


Putting the two bounds together we get: 

\\W-WW^\\2 < 


Til ^ polylogjn) 


ml 


Since we already showed m > polylogin) in lemma 1441 setting 
I = j suffices to complete the proof. □ 

Theorem 4.1. Let W € be similarity matrix of S £ 

and W = WB^BW^ be the weight matrix constructed by 
W € and B £ R^^"*, where W is the set of columns sam¬ 

pled with replacement from W, and B is the output of algorithm 
I3.il Then for m = LI (ipo/v/og(n) log(l/5)) and £ = 0(j).' 

IIW^ - WII 2 < ellWfp 
holds with probability 1 — 5. 

Proof. Having results of Lemmas 14. Il and l4. 21 assuming m to 
be sufficiently large to meet conditions of both Lemmas and using 
triangle inequality and rescaling e to e /2 proves the result. □ 

We explain some informal intuition of what Theorem |4T] im¬ 
plies. First we could infer ||VF — '\/ll ^||2 is small. Now writ¬ 
ing the SVD decomposition of B^B as USU^, we get W = 
WUSU^W\ = WUS^W''.^ The intuition then is that if 
is similar to then s/W ~ WU which is just the projec¬ 
tion of W onto the right singular space of B. This suggests that the 
right singular space of B captures most of the spectrum of FF, in 
the sense that a column sample of W projected on the right singular 
space of B and scaled appropriately recovers W closely. 

5. EXPERIMENTS 

Herein we describe an extensive set of experiments on a wide 
variety of large input data sets. We ran all algorithms under a com¬ 
mon implementation framework using Matlab to have a fair basis 
for comparision. 

We compared efficiency and accuracy of our algorithm (SSBC) 
versus well-known streaming binary encoding techniques, includ¬ 
ing “Multidimensional Spectral Hashing’ YMDSH) l4d]l .“Localitv Sen¬ 
sitive Hashing” (LSHi lflCT and “Spectral Hashing” (SHi lld^ . 

We also compare accuracy of these algorithms against exact so¬ 
lution for binary coding problem when posed as a matrix optimiza¬ 
tion. As the exact solution, we compute the affinity matrix of whole 
dataset Stotai = [Strain; iStest], and take the eigen decomposition 
of that. Let FFtotai € R"^" denote the affinity matrix for Stotai- 
If Wtotai = UAU"^ is the eigen decomposition of Wtotai, then i- 
row of signiUk) matrix provides a binary code of length k to i-th 
datapoint in Stotai- In our experiments, we considered two types of 
thresholding on exact solution, namely deterministic rounding and 
randomized rounding. The deterministic rounding version is called 
“Exact-D” in the plots, and it basically takes the sign of Uk only. 
The randomized rounding one is called “Exact-R” in the plots, and 
what it does is that after computing Uk it multiplies it by a random 
rotation matrix R £ R"^"', and then takes the sign of entries. 

Datasets. 

We compare performance of our algorithm on both synthetic and 
real datasets. Each data set is divided into two subsets. Strain and 
Stest, with same number of dimensions and different number of 
datapoints. Table[T]lists all datasets along with some statistics about 
them. We refer to each set as an n x d matrix A, with n datapoints 
and d dimensions. Training Set is taken small in size so that it 
easily fits into memory, while Stest is a large stream of data whose 
datapoints are processed one-by-one by our algorithm. 

As synthetic dataset we used multidimensional uniform distribu¬ 
tion with d = 50 dimensions in which f-th dimension Vt, 1 < 
t < d has a uniform distribution in range [0, (1/f)^]. In spectral 
hashing algorithmic^], authors argue their learned eigenfunctions 














DataSet 

# Train 

#Test 

Dimension 

Rank 

PAMAP 

100 

21000 

44 

44 

CBM 

200 

11000 

18 

16 

Uniform 

500 

10000 

50 

50 

Covtype 

500 

20000 

54 

53 


Table 1: Datasets Statistics. 


converge most sharply for rectangle distribution and include experi¬ 
mental results on uniform distributions demonstrating this efficacy. 
We added such dataset here so as to evaluate SSBC for a dataset 
model well suited to their algorithm. 

We used three real-world datasets in our experiments. In each 
dataset, we uniformly sampled a small subset of data at random 
and considered it as Strain, and used a subset of remaining part 
as Steat- Information about size of training set and test set is pro¬ 
vided in table[T] First real-world dataset was the famous Covtvpellll 
that contains information about predicting forest cover type from 
cartographic variables. Second one was CBM or “Condition Based 
Maintenance of Naval Propulsion Plants” (a which is a dataset gen¬ 
erated from simulator of a gas turbine propulsion plant. It contains 
11934 datapoints in d = 16 dimensional space. 

The PAMAP j^ dataset is a Physical Activity Monitoring dataset 
that contains data of 18 different physical activities (such as walk¬ 
ing, cycling, playing soccer, etc.), performed by 9 subjects wearing 
3 inertial measurement units and a heart rate monitor. The dataset 
contains 54 columns including a timestamp, an activity label (the 
ground truth) and 52 attributes of raw sensory data. In our experi¬ 
ments, we removed columns containing missing values and used a 
subset with d = 44 columns. 

Metrics. 

We use three following metrics to compare accuracy of discussed 
algorithms: 





Number of Bits 




Figure 2: Results on CBM dataset. 


• Precision'. The number of true similar datapoints returned by 
an algorithm over total number of datapoints returned by the 
algorithm. 

• Recall'. The number of true similar datapoints returned by an 
algorithm over correct number of similar datapoints. 

• Mean Average Precision (MAP)'. The mean of the average 
precision scores for each test point. 

We have used the Guassian function w{p,q) = exp(—||p — 
(/IP/ct) to compute affinity between any two datapoints p and q. We 
set a in each dataset to the average distance of all train datapoints 
to their 30-th nearest neighbour, and set this threshold in both Ham¬ 
ming and Euclidean space to designate whether two points are sim¬ 
ilar. We refer to this parameter as CT 30 . We have used CT 30 in all the 
experiments involving “precision" and “recall" metrics. For Mean 
Average Precision(MAP) metric, we consider 3 different similarity 
levels comprising a 30 , the average of all pairs distance in train¬ 
ing set (ffaii) , and (J 3 o/ 4 . In all cases, we set the choice of the 
a parameter in our Gaussian weight kernel equal to our similarity 
threshold for classifying points as near. The number of bits we use 
ranges from fc = 20 to fc = 50 with increments of 5. 

As we observe in precision and recall plots of figures [J14I2I and 
[T] SSBC performs exceptionally well on precision, providing very 
few “false positives" compared to the other algorithms and con¬ 
sistently providing the highest precision of the methods evaluated. 
On recall metric also SSBC provides the best results over all the 



approaches evaluated. In both cases, this edge in performance is 
maintained over all tested ranges of length k G [20, 50] of code¬ 
words. Combining these two plots we get precision-recall com¬ 
parison (last plot in all above mentioned figures) which shows that 
SSBC forms an almost 45-degree line in all figures, i.e. basically 
its mistake rate does not increase by returning more candidates for 
nearest neighbours (having high recall). 

In a separate set of experiments, we compared accuracy of all 













































Figure 4: Results on Covtype dataset. 
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Figure 5: Comparing algorithms with Exact methods on CBM 
(first row) and PAMAP (second row) datasets. Training set for 
each dataset was of size 200 and 100 and test set was of size 
1000 and 3000, respectively. 


algorithms with exact methods. This time in order to allow exact 
algorithms to load the whole n by n weight matrix in RAM, we 
used a much smaller test set. Size of test set and training set for 
these experiments are mentioned in caption of plot [3 As we see in 
this plot, SSBC secures higher mean average precision and recall 
than the exact methods, “exact-D” and “exact-R” which solve the 
matrix optimization by applying an SVD over enitre dataset. This 
is likely because maintaining a column sample of the weight matrix 
through a training set helps prevent overfitting errors. 
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